Linear regression homework with Yelp votes

Introduction

This assignment uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.

Description of the data:

  • yelp.json is the original format of the file. yelp.csv contains the same data, in a more convenient format. Both of the files are in this repo, so there is no need to download the data from the Kaggle website.
  • Each observation in this dataset is a review of a particular business by a particular user.
  • The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
  • The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
  • The "useful" and "funny" columns are similar to the "cool" column.

Task 1

Read yelp.csv into a DataFrame.


In [12]:
# access yelp.csv using a relative path
import pandas as pd
yelp = pd.read_csv('/GA-SEA-DAT2/data/yelp.csv')
yelp.head(1)


Out[12]:
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0

Task 1 (Bonus)

Ignore the yelp.csv file, and construct this DataFrame yourself from yelp.json. This involves reading the data into Python, decoding the JSON, converting it to a DataFrame, and adding individual columns for each of the vote types.


In [1]:
# read the data from yelp.json into a list of rows
# each row is decoded into a dictionary named "data" using using json.loads()
import json
import pandas as pd
with open('../data/yelp.json', 'rU') as f:
    data = [json.loads(row) for row in f]

In [2]:
# show the first review
data[0]


Out[2]:
{u'business_id': u'9yKzy9PApeiPPOUJEtnvkg',
 u'date': u'2011-01-26',
 u'review_id': u'fWKvX83p0-ka4JS3dc6E5A',
 u'stars': 5,
 u'text': u'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!',
 u'type': u'review',
 u'user_id': u'rLtl8ZkDX5vH5nAx9C3q5Q',
 u'votes': {u'cool': 2, u'funny': 0, u'useful': 5}}

In [3]:
# convert the list of dictionaries to a DataFrame
df = pd.DataFrame.from_dict(data, orient='columns')

print(df)


                 business_id        date               review_id  stars  \
0     9yKzy9PApeiPPOUJEtnvkg  2011-01-26  fWKvX83p0-ka4JS3dc6E5A      5   
1     ZRJwVLyzEJq1VAihDhYiow  2011-07-27  IjZ33sJrzXqU-0X6U8NwyA      5   
2     6oRAC4uyJCsJl1X0WZpVSA  2012-06-14  IESLBzqUCLdSzSqm0eCSxQ      4   
3     _1QQZuf4zZOyFCvXc0o6Vg  2010-05-27  G-WvGaISbqqaMHlNnByodA      5   
4     6ozycU1RpktNG2-1BroVtw  2012-01-05  1uJFq2r5QfJG_6ExMRCaGw      5   
5     -yxfBYGB6SEqszmxJxd97A  2007-12-13  m2CKSsepBCoRYWxiRUsxAg      4   
6     zp713qNhx8d9KCJJnrw1xA  2010-02-12  riFQ3vxNpP4rWLk_CSri2A      5   
7     hW0Ne_HTHEAgGF1rAdmR-g  2012-07-12  JL7GXJ9u4YMx7Rzs05NfiQ      4   
8     wNUea3IXZWD63bbOQaOH-g  2012-08-17  XtnfnYmnJYi71yIuGsXIUA      4   
9     nMHhuYan8e3cONo3PornJA  2010-08-11  jJAIXA46pU1swYyRCdfXtQ      5   
10    AsSCv0q_BWqIe3mX2JqsOQ  2010-06-16  E11jzpKz9Kw5K7fuARWfRw      5   
11    e9nN4XxjdHj4qtKCOPq_vg  2011-10-21  3rPt0LxF7rgmEUrznoH22w      5   
12    h53YuCiIDfEFSJCQpk8v1g  2010-01-11  cGnKNX3I9rthE0-TH24-qA      5   
13    WGNIYMeXPyoWav1APUq7jA  2011-12-23  FvEEw1_OsrYdvwLV5Hrliw      4   
14    yc5AH9H71xJidA_J2mChLA  2010-05-20  pfUwBKYYmUXeiwrhDluQcw      4   
15    Vb9FPCEL6Ly24PNxLBaAFw  2011-03-20  HvqmdqWcerVWO3Gs6zbrOw      2   
16    supigcPNO9IKo6olaTNV-g  2008-10-12  HXP_0Ul-FCmA4f-k9CqvaQ      3   
17    O510Re68mOy9dU490JTKCg  2010-05-03  j4SIzrIy0WrmW4yr4--Khg      5   
18    b5cEoKR8iQliq-yT2_O0LQ  2009-03-06  v0cTd3PNpYCkTyGKSpOfGA      3   
19    4JzzbSbK9wmlOBJZWYfuCg  2011-11-17  a0lCu-j2Sk_kHQsZi_eNgw      4   
20    8FNO4D3eozpIjj0k3q5Zbg  2008-10-08  MuqugTuR5DdIPcZ2IVP3aQ      3   
21    tdcjXyFLMKAsvRhURNOkCg  2011-06-28  LmuKVFh03Uz318VKnUWrxA      5   
22    eFA9dqXT5EA_TrMgbo03QQ  2011-07-13  CQYc8hgKxV4enApDkx0IhA      5   
23    IJ0o6b8bJFAbG6MjGfBebQ  2010-09-05  Dx9sfFU6Zn0GYOckijom-g      1   
24    JhupPnWfNlMJivnWB5druA  2011-05-22  cFtQnKzn2VDpBedy_TxlvA      5   
25    wzP2yNpV5p04nh0injjymA  2010-05-26  ChBeixVZerfFkeO0McdlbA      4   
26    qjmCVYkwP-HDa35jwYucbQ  2013-01-03  kZ4TzrVX6qeF0OvrVTGVEw      5   
27    wct7rZKyZqZftzmAU-vhWQ  2008-03-21  B5h25WK28rJjx4KHm4gr7g      4   
28    vz2zQQSjy-NnnKLZzjjoxA  2011-03-30  Y_ERKao0J5WsRiCtlKSNSA      4   
29    i213sY5rhkfCO8cD-FPr1A  2012-07-12  hre97jjSwon4bn1muHKOJg      4   
...                      ...         ...                     ...    ...   
9970  R6aazv8FB-6BeanY3ag8kw  2009-09-26  gP17ykqduf3AlewSaRb61w      5   
9971  JOZqBKIOB8WEBAWm7v1JFA  2008-07-22  QI9rfeWrZnvK5ojz8cEoRg      5   
9972  OllL0G9Kh_k1lx-2vrFDXQ  2012-10-23  U23UfuxN9DpAU0Dslc5KjQ      4   
9973  XHr5mXFgobOHoxbPJxmYdg  2009-09-28  udMiWjeG0OGcb4nNddDkBg      5   
9974  cdacUBBL2tDbDnB1EfhpQw  2009-12-16  bVU-_x9ijxjEImNluy84OA      2   
9975  EWMwV5V9BxNs_U6nNVMeqw  2007-10-20  g4LsVAoafmUDHiS-_yN4tA      5   
9976  iDYzGVIF1TDWdjHNgNjCVw  2009-09-11  bKjMcpNj0xSu2UI2EFQn1g      3   
9977  iDYzGVIF1TDWdjHNgNjCVw  2012-10-30  qaNZyCUJA6Yp0mvPBCknPQ      5   
9978  9Y3aQAVITkEJYe5vLZr13w  2010-04-01  ZoTUU6EJ1OBNr7mhqxHBLw      5   
9979  GV1P1x9eRb4iZHCxj5_IjA  2012-12-07  eVUs1C4yaVJNrc7SGTAheg      5   
9980  GHYOl_cnERMOhkCK_mGAlA  2011-07-03  Q-y3jSqccdytKxAyo1J0Xg      5   
9981  AX8lx9wHNYT45lyd7pxaYw  2008-11-27  IyunTh7jnG7v3EYwfF3hPw      5   
9982  KV-yJLmlODfUG1Mkds6kYw  2012-02-25  rIgZgxJPWTacq3mV6DfWfg      4   
9983  24V8QQWO6VaVggHdxjQQ_A  2010-06-06  PqiIeFOiVr-tj_FtHGAH2g      3   
9984  wepFVY82q_tuDzG6lQjHWw  2012-02-12  spusZYROtBKw_5tv3gYm4Q      1   
9985  EMGkbiCMfMTflQux-_JY7Q  2012-10-17  wB-f0xfx7WIyrOsRJMkDOg      4   
9986  oCA2OZcd_Jo_ggVmUx3WVw  2012-03-31  ijPZPKKWDqdWOIqYkUsJJw      4   
9987  r-a-Cn9hxdEnYTtVTB5bMQ  2012-04-07  j9HwZZoBBmJgOlqDSuJcxg      1   
9988  xY1sPHTA2RGVFlh5tZhs9g  2012-06-02  TM8hdYqs5Zi1jO5Yrq6E0g      4   
9989  mQUC-ATrFuMQSaDQb93Pug  2011-10-01  ta2P9joJqeFB8BzFp-AzjA      5   
9990  R8VwdLyvsp9iybNqRvm94g  2011-10-03  pcEeHdAJPoFNF23es0kKWg      5   
9991  WJ5mq4EiWYAA4Vif0xDfdg  2011-12-05  EuHX-39FR7tyyG1ElvN1Jw      5   
9992  f96lWMIAUhYIYy9gOktivQ  2009-03-10  YF17z7HWlMj6aezZc-pVEw      5   
9993  maB4VHseFUY2TmPtAQnB9Q  2011-06-27  SNnyYHI9rw9TTltVX3TF-A      4   
9994  L3BSpFvxcNf3T_teitgt6A  2012-03-19  0nxb1gIGFgk3WbC5zwhKZg      5   
9995  VY_tvNUCCXGXQeSvJl757Q  2012-07-28  Ubyfp2RSDYW0g7Mbr8N3iA      3   
9996  EKzMHI1tip8rC1-ZAy64yg  2012-01-18  2XyIOQKbVFb6uXQdJ0RzlQ      4   
9997  53YGfwmbW73JhFiemNeyzQ  2010-11-16  jyznYkIbpqVmlsZxSDSypA      4   
9998  9SKdOoDHcFoxK5ZtsgHJoA  2012-12-02  5UKq9WQE1qQbJ0DJbc-B6Q      2   
9999  pF7uRzygyZsltbmVpjIyvw  2010-10-16  vWSmOhg2ID1MNZHaWapGbA      5   

                                                   text    type  \
0     My wife took me here on my birthday for breakf...  review   
1     I have no idea why some people give bad review...  review   
2     love the gyro plate. Rice is so good and I als...  review   
3     Rosie, Dakota, and I LOVE Chaparral Dog Park!!...  review   
4     General Manager Scott Petello is a good egg!!!...  review   
5     Quiessence is, simply put, beautiful.  Full wi...  review   
6     Drop what you're doing and drive here. After I...  review   
7     Luckily, I didn't have to travel far to make m...  review   
8     Definitely come for Happy hour! Prices are ama...  review   
9     Nobuo shows his unique talents with everything...  review   
10    The oldish man who owns the store is as sweet ...  review   
11    Wonderful Vietnamese sandwich shoppe. Their ba...  review   
12    They have a limited time thing going on right ...  review   
13    Good tattoo shop. Clean space, multiple artist...  review   
14    I'm 2 weeks new to Phoenix. I looked up Irish ...  review   
15    Was it worth the 21$ for a salad and small piz...  review   
16    We went here on a Saturday afternoon and this ...  review   
17    okay this is the best place EVER! i grew up sh...  review   
18    I met a friend for lunch yesterday. \n\nLoved ...  review   
19    They've gotten better and better for me in the...  review   
20    DVAP....\n\nYou have to go at least once in yo...  review   
21    This place shouldn't even be reviewed - becaus...  review   
22    first time my friend and I went there... it wa...  review   
23    U can go there n check the car out. If u wanna...  review   
24    I love this place! I have been coming here for...  review   
25    This place is great.  A nice little ole' fashi...  review   
26    I love love LOVE this place. My boss (who is i...  review   
27    Not that my review will mean much given the mo...  review   
28    Came here for breakfast yesterday, it had been...  review   
29    Always reliably good.  Great beer selection as...  review   
...                                                 ...     ...   
9970  This place is super cute lunch joint.  I had t...  review   
9971  The staff is great, the food is great, even th...  review   
9972  Yay, even though I miss living in Coronado I a...  review   
9973  Wow!  Went on a Sunday around 11am - busy but ...  review   
9974  If Cowboy Ciao is the best restaurant in Scott...  review   
9975  When I lived in Phoenix, I was a regular at Fe...  review   
9976  I was looking for chile rellenos and this plac...  review   
9977  Why did I wait so long to try this neighborhoo...  review   
9978  This is the place for a fabulos breakfast!! I ...  review   
9979  Highly recommend. This is my second time here ...  review   
9980  5 stars for the great $5 happy hour specials. ...  review   
9981  We brought the entire family to Giuseppe's las...  review   
9982  Best corned beef sandwich I've had anywhere at...  review   
9983  3.5 stars. \n\nWe decided to check this place ...  review   
9984  Went last night to Whore Foods to get basics t...  review   
9985  Awesome food! Little pricey but delicious. Lov...  review   
9986  I came here in December and look forward to my...  review   
9987  The food is delicious.  The service:  discrimi...  review   
9988  For our first time we had a great time! Our se...  review   
9989  Great food and service! Country food at its best!  review   
9990  Yes I do rock the hipster joints.  I dig this ...  review   
9991  Only 4 stars? \n\n(A few notes: The folks that...  review   
9992  I'm not normally one to jump at reviewing a ch...  review   
9993  Judging by some of the reviews, maybe I went o...  review   
9994  Let's see...what is there NOT to like about Su...  review   
9995  First visit...Had lunch here today - used my G...  review   
9996  Should be called house of deliciousness!\n\nI ...  review   
9997  I recently visited Olive and Ivy for business ...  review   
9998  My nephew just moved to Scottsdale recently so...  review   
9999  4-5 locations.. all 4.5 star average.. I think...  review   

                     user_id                                     votes  
0     rLtl8ZkDX5vH5nAx9C3q5Q   {u'funny': 0, u'useful': 5, u'cool': 2}  
1     0a2KyEL0d3Yb1V6aivbIuQ   {u'funny': 0, u'useful': 0, u'cool': 0}  
2     0hT2KtfLiobPvh6cDC8JQg   {u'funny': 0, u'useful': 1, u'cool': 0}  
3     uZetl9T0NcROGOyFfughhg   {u'funny': 0, u'useful': 2, u'cool': 1}  
4     vYmM4KTsC8ZfQBg-j5MWkw   {u'funny': 0, u'useful': 0, u'cool': 0}  
5     sqYN3lNgvPbPCTRsMFu27g   {u'funny': 1, u'useful': 3, u'cool': 4}  
6     wFweIWhv2fREZV_dYkz_1g   {u'funny': 4, u'useful': 7, u'cool': 7}  
7     1ieuYcKS7zeAv_U15AB13A   {u'funny': 0, u'useful': 1, u'cool': 0}  
8     Vh_DlizgGhSqQh4qfZ2h6A   {u'funny': 0, u'useful': 0, u'cool': 0}  
9     sUNkXg8-KFtCMQDV6zRzQg   {u'funny': 0, u'useful': 1, u'cool': 0}  
10    -OMlS6yWkYjVldNhC31wYg   {u'funny': 1, u'useful': 3, u'cool': 1}  
11    C1rHp3dmepNea7XiouwB6Q   {u'funny': 0, u'useful': 1, u'cool': 1}  
12    UPtysDF6cUDUxq2KY-6Dcg   {u'funny': 0, u'useful': 2, u'cool': 1}  
13    Xm8HXE1JHqscXe5BKf0GFQ   {u'funny': 0, u'useful': 2, u'cool': 1}  
14    JOG-4G4e8ae3lx_szHtR8g   {u'funny': 0, u'useful': 1, u'cool': 1}  
15    ylWOj2y7TV2e3yYeWhu2QA   {u'funny': 0, u'useful': 2, u'cool': 0}  
16    SBbftLzfYYKItOMFwOTIJg   {u'funny': 2, u'useful': 4, u'cool': 3}  
17    u1KWcbPMvXFEEYkZZ0Yktg   {u'funny': 0, u'useful': 0, u'cool': 0}  
18    UsULgP4bKA8RMzs8dQzcsA   {u'funny': 4, u'useful': 6, u'cool': 5}  
19    nDBly08j5URmrHQ2JCbyiw   {u'funny': 1, u'useful': 1, u'cool': 1}  
20    C6IOtaaYdLIT5fWd7ZYIuA   {u'funny': 1, u'useful': 4, u'cool': 2}  
21    YN3ZLOdg8kpnfbVcIhuEZA   {u'funny': 2, u'useful': 1, u'cool': 1}  
22    6lg55RIP23VhjYEBXJ8Njw   {u'funny': 0, u'useful': 0, u'cool': 0}  
23    zRlQEDYd_HKp0VS3hnAffA   {u'funny': 1, u'useful': 1, u'cool': 0}  
24    13xj6FSvYO0rZVRv5XZp4w   {u'funny': 0, u'useful': 1, u'cool': 0}  
25    rLtl8ZkDX5vH5nAx9C3q5Q   {u'funny': 0, u'useful': 0, u'cool': 0}  
26    fpItLlgimq0nRltWOkuJJw   {u'funny': 0, u'useful': 0, u'cool': 0}  
27    RRTraCQw77EU4yZh0BBTag   {u'funny': 1, u'useful': 4, u'cool': 2}  
28    EP3cGJvYiuOwumerwADplg   {u'funny': 1, u'useful': 1, u'cool': 1}  
29    kpbhy1zPewGDmdNfNqQp-g   {u'funny': 0, u'useful': 1, u'cool': 0}  
...                      ...                                       ...  
9970  mtoKqaQjGPWEc5YZbrYV9w   {u'funny': 0, u'useful': 0, u'cool': 0}  
9971  uBAMd01ZtGXaHrRD6THNzg   {u'funny': 1, u'useful': 2, u'cool': 1}  
9972  Gh1EXuS42DY3rV_MzFpJpg   {u'funny': 0, u'useful': 0, u'cool': 0}  
9973  yRYNx24kUDRRBfJu1Rcojg   {u'funny': 0, u'useful': 0, u'cool': 0}  
9974  V9Uqt00HXwXT6mzsVCjMAw   {u'funny': 0, u'useful': 0, u'cool': 0}  
9975  TLj3XaclA7V4ldJ5yNP-9Q   {u'funny': 0, u'useful': 1, u'cool': 1}  
9976  2tUCLMHQKz4kA1VlRB_w0Q   {u'funny': 0, u'useful': 0, u'cool': 0}  
9977  Id-8-NMEKxeXBR44eUdDeA   {u'funny': 3, u'useful': 6, u'cool': 3}  
9978  vasHsAZEgLZGJDTlIweUYQ   {u'funny': 0, u'useful': 1, u'cool': 0}  
9979  bJFdmJJxfXgCYA5DMmyeqQ   {u'funny': 1, u'useful': 2, u'cool': 2}  
9980  xZvRLPJ1ixhFVomkXSfXAw   {u'funny': 4, u'useful': 6, u'cool': 6}  
9981  fczQCSmaWF78toLEmb0Zsw  {u'funny': 5, u'useful': 9, u'cool': 10}  
9982  J-oVr0th2Y7ltPPOwy0Z8Q   {u'funny': 0, u'useful': 0, u'cool': 0}  
9983  LaEj3VpQh7bgpAZLzSRRrw   {u'funny': 1, u'useful': 4, u'cool': 1}  
9984  W7zmm1uzlyUkEqpSG7PlBw   {u'funny': 2, u'useful': 1, u'cool': 0}  
9985  9MJAacmjxtctbI3xncsK5Q   {u'funny': 0, u'useful': 0, u'cool': 0}  
9986  yzwPJdn6yd2ccZqfy4LhUA   {u'funny': 0, u'useful': 0, u'cool': 0}  
9987  toPtsUtYoRB-5-ThrOy2Fg   {u'funny': 0, u'useful': 0, u'cool': 0}  
9988  GvaNZY4poCcd3H4WxHjrLQ   {u'funny': 0, u'useful': 2, u'cool': 0}  
9989  fKaO8fR1IAcfvZb6cBrs2w   {u'funny': 0, u'useful': 1, u'cool': 0}  
9990  b92Y3tyWTQQZ5FLifex62Q   {u'funny': 1, u'useful': 1, u'cool': 1}  
9991  hTau-iNZFwoNsPCaiIUTEA   {u'funny': 0, u'useful': 1, u'cool': 1}  
9992  W_QXYA7A0IhMrvbckz7eVg   {u'funny': 2, u'useful': 3, u'cool': 2}  
9993  T46gxPbJMWmlLyr7GxQLyQ   {u'funny': 0, u'useful': 1, u'cool': 1}  
9994  OzOZv-Knlw3oz9K5Kh5S6A   {u'funny': 1, u'useful': 2, u'cool': 1}  
9995  _eqQoPtQ3e3UxLE4faT6ow   {u'funny': 0, u'useful': 2, u'cool': 1}  
9996  ROru4uk5SaYc3rg8IU7SQw   {u'funny': 0, u'useful': 0, u'cool': 0}  
9997  gGbN1aKQHMgfQZkqlsuwzg   {u'funny': 0, u'useful': 0, u'cool': 0}  
9998  0lyVoNazXa20WzUyZPLaQQ   {u'funny': 0, u'useful': 0, u'cool': 0}  
9999  KSBFytcdjPKZgXKQnYQdkA   {u'funny': 0, u'useful': 0, u'cool': 0}  

[10000 rows x 8 columns]

In [4]:
# add DataFrame columns for cool, useful, and funny

df['cool'] = [row['votes']['cool'] for row in data]
df['useful'] = [row['votes']['useful'] for row in data]
df['funny'] = [row['votes']['funny'] for row in data]

In [5]:
# drop the votes column and then display the head

df.drop('votes', axis=1, inplace=True)

Task 2

Explore the relationship between each of the vote types (cool/useful/funny) and the number of stars.


In [6]:
# treat stars as a categorical variable and look for differences between groups by comparing the means of the groups
df.groupby('stars').mean()


Out[6]:
cool useful funny
stars
1 0.576769 1.604806 1.056075
2 0.719525 1.563107 0.875944
3 0.788501 1.306639 0.694730
4 0.954623 1.395916 0.670448
5 0.944261 1.381780 0.608631

In [7]:
# display acorrelation matrix of the vote types (cool/useful/funny) and stars
%matplotlib inline
import seaborn as sns
sns.heatmap(df.corr())


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0xca5e4a8>

In [8]:
# display multiple scatter plots (cool, useful, funny) with linear regression line
sns.lmplot(x='cool', y='stars', data=df, ci=95, fit_reg=True)
sns.plt.xlim(-1, 90)
sns.plt.ylim(-1, 10)



sns.lmplot(x='funny', y='stars', data=df, ci=95, fit_reg=True)
sns.plt.xlim(-1, 90)
sns.plt.ylim(-1, 10)


sns.lmplot(x='useful', y='stars', data=df, ci=95, fit_reg=True)
sns.plt.xlim(-1, 90)
sns.plt.ylim(-1, 10)


Out[8]:
(-1, 10)

Task 3

Define cool/useful/funny as the feature matrix X, and stars as the response vector y.


In [9]:
#feature matrix
feature_cols = ['cool', 'useful', 'funny']
X = df[feature_cols]

#response vector 
response_vector = ['stars']
y = df[response_vector]

Task 4

Fit a linear regression model and interpret the coefficients. Do the coefficients make intuitive sense to you? Explore the Yelp website to see if you detect similar trends.


In [10]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X, y)

print reg.intercept_
print reg.coef_


[ 3.83989479]
[[ 0.27435947 -0.14745239 -0.13567449]]

Task 5

Evaluate the model by splitting it into training and testing sets and computing the RMSE. Does the RMSE make intuitive sense to you?


In [11]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np

In [12]:
# define a function that accepts a list of features and returns testing RMSE
def rmse_train_test(feature_cols):
    X = df[feature_cols]
    yy = df[response_vector]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=777)
    reg = LinearRegression()
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    return np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [17]:
# calculate RMSE with all three features
#rmse_train_test(feature_cols)

#alt 
rmse_train_test(['cool', 'useful', 'funny'])


Out[17]:
1.1979297318530078

Task 6

Try removing some of the features and see if the RMSE improves.


In [22]:
print("RMSE for funny feature:")
print(rmse_train_test(['funny']))


print("RMSE for cool feature:")
print(rmse_train_test(['cool']))


print("RMSE for useful feature:")
print(rmse_train_test(['useful']))


#cool and useful seem to be the features that work best
print("RMSE for cool and useful feature:")
print(rmse_train_test(['cool', 'useful']))


print("RMSE for cool and funny feature:")
print(rmse_train_test(['cool', 'funny']))


print("RMSE for funny and useful feature:")
print(rmse_train_test(['funny', 'useful']))


print("RMSE for all feature:")
print(rmse_train_test(['cool', 'useful', 'funny']))


RMSE for funny feature:
1.20619614865
RMSE for cool feature:
1.20616834353
RMSE for useful feature:
1.21090710448
RMSE for cool and useful feature:
1.19740928009
RMSE for cool and funny feature:
1.20877292668
RMSE for funny and useful feature:
1.20487286473
RMSE for all feature:
1.19792973185

Task 7 (Bonus)

Think of some new features you could create from the existing data that might be predictive of the response. Figure out how to create those features in Pandas, add them to your model, and see if the RMSE improves.


In [24]:
# new feature: 

df['sum_votes'] = df['cool'] + df['useful'] + df['funny']


Out[24]:
business_id date review_id stars text type user_id cool useful funny sum_votes
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 7
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 0
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als... review 0hT2KtfLiobPvh6cDC8JQg 0 1 0 1
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review uZetl9T0NcROGOyFfughhg 1 2 0 3
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!... review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 0
5 -yxfBYGB6SEqszmxJxd97A 2007-12-13 m2CKSsepBCoRYWxiRUsxAg 4 Quiessence is, simply put, beautiful. Full wi... review sqYN3lNgvPbPCTRsMFu27g 4 3 1 8
6 zp713qNhx8d9KCJJnrw1xA 2010-02-12 riFQ3vxNpP4rWLk_CSri2A 5 Drop what you're doing and drive here. After I... review wFweIWhv2fREZV_dYkz_1g 7 7 4 18
7 hW0Ne_HTHEAgGF1rAdmR-g 2012-07-12 JL7GXJ9u4YMx7Rzs05NfiQ 4 Luckily, I didn't have to travel far to make m... review 1ieuYcKS7zeAv_U15AB13A 0 1 0 1
8 wNUea3IXZWD63bbOQaOH-g 2012-08-17 XtnfnYmnJYi71yIuGsXIUA 4 Definitely come for Happy hour! Prices are ama... review Vh_DlizgGhSqQh4qfZ2h6A 0 0 0 0
9 nMHhuYan8e3cONo3PornJA 2010-08-11 jJAIXA46pU1swYyRCdfXtQ 5 Nobuo shows his unique talents with everything... review sUNkXg8-KFtCMQDV6zRzQg 0 1 0 1
10 AsSCv0q_BWqIe3mX2JqsOQ 2010-06-16 E11jzpKz9Kw5K7fuARWfRw 5 The oldish man who owns the store is as sweet ... review -OMlS6yWkYjVldNhC31wYg 1 3 1 5
11 e9nN4XxjdHj4qtKCOPq_vg 2011-10-21 3rPt0LxF7rgmEUrznoH22w 5 Wonderful Vietnamese sandwich shoppe. Their ba... review C1rHp3dmepNea7XiouwB6Q 1 1 0 2
12 h53YuCiIDfEFSJCQpk8v1g 2010-01-11 cGnKNX3I9rthE0-TH24-qA 5 They have a limited time thing going on right ... review UPtysDF6cUDUxq2KY-6Dcg 1 2 0 3
13 WGNIYMeXPyoWav1APUq7jA 2011-12-23 FvEEw1_OsrYdvwLV5Hrliw 4 Good tattoo shop. Clean space, multiple artist... review Xm8HXE1JHqscXe5BKf0GFQ 1 2 0 3
14 yc5AH9H71xJidA_J2mChLA 2010-05-20 pfUwBKYYmUXeiwrhDluQcw 4 I'm 2 weeks new to Phoenix. I looked up Irish ... review JOG-4G4e8ae3lx_szHtR8g 1 1 0 2
15 Vb9FPCEL6Ly24PNxLBaAFw 2011-03-20 HvqmdqWcerVWO3Gs6zbrOw 2 Was it worth the 21$ for a salad and small piz... review ylWOj2y7TV2e3yYeWhu2QA 0 2 0 2
16 supigcPNO9IKo6olaTNV-g 2008-10-12 HXP_0Ul-FCmA4f-k9CqvaQ 3 We went here on a Saturday afternoon and this ... review SBbftLzfYYKItOMFwOTIJg 3 4 2 9
17 O510Re68mOy9dU490JTKCg 2010-05-03 j4SIzrIy0WrmW4yr4--Khg 5 okay this is the best place EVER! i grew up sh... review u1KWcbPMvXFEEYkZZ0Yktg 0 0 0 0
18 b5cEoKR8iQliq-yT2_O0LQ 2009-03-06 v0cTd3PNpYCkTyGKSpOfGA 3 I met a friend for lunch yesterday. \n\nLoved ... review UsULgP4bKA8RMzs8dQzcsA 5 6 4 15
19 4JzzbSbK9wmlOBJZWYfuCg 2011-11-17 a0lCu-j2Sk_kHQsZi_eNgw 4 They've gotten better and better for me in the... review nDBly08j5URmrHQ2JCbyiw 1 1 1 3
20 8FNO4D3eozpIjj0k3q5Zbg 2008-10-08 MuqugTuR5DdIPcZ2IVP3aQ 3 DVAP....\n\nYou have to go at least once in yo... review C6IOtaaYdLIT5fWd7ZYIuA 2 4 1 7
21 tdcjXyFLMKAsvRhURNOkCg 2011-06-28 LmuKVFh03Uz318VKnUWrxA 5 This place shouldn't even be reviewed - becaus... review YN3ZLOdg8kpnfbVcIhuEZA 1 1 2 4
22 eFA9dqXT5EA_TrMgbo03QQ 2011-07-13 CQYc8hgKxV4enApDkx0IhA 5 first time my friend and I went there... it wa... review 6lg55RIP23VhjYEBXJ8Njw 0 0 0 0
23 IJ0o6b8bJFAbG6MjGfBebQ 2010-09-05 Dx9sfFU6Zn0GYOckijom-g 1 U can go there n check the car out. If u wanna... review zRlQEDYd_HKp0VS3hnAffA 0 1 1 2
24 JhupPnWfNlMJivnWB5druA 2011-05-22 cFtQnKzn2VDpBedy_TxlvA 5 I love this place! I have been coming here for... review 13xj6FSvYO0rZVRv5XZp4w 0 1 0 1
25 wzP2yNpV5p04nh0injjymA 2010-05-26 ChBeixVZerfFkeO0McdlbA 4 This place is great. A nice little ole' fashi... review rLtl8ZkDX5vH5nAx9C3q5Q 0 0 0 0
26 qjmCVYkwP-HDa35jwYucbQ 2013-01-03 kZ4TzrVX6qeF0OvrVTGVEw 5 I love love LOVE this place. My boss (who is i... review fpItLlgimq0nRltWOkuJJw 0 0 0 0
27 wct7rZKyZqZftzmAU-vhWQ 2008-03-21 B5h25WK28rJjx4KHm4gr7g 4 Not that my review will mean much given the mo... review RRTraCQw77EU4yZh0BBTag 2 4 1 7
28 vz2zQQSjy-NnnKLZzjjoxA 2011-03-30 Y_ERKao0J5WsRiCtlKSNSA 4 Came here for breakfast yesterday, it had been... review EP3cGJvYiuOwumerwADplg 1 1 1 3
29 i213sY5rhkfCO8cD-FPr1A 2012-07-12 hre97jjSwon4bn1muHKOJg 4 Always reliably good. Great beer selection as... review kpbhy1zPewGDmdNfNqQp-g 0 1 0 1
... ... ... ... ... ... ... ... ... ... ... ...
9970 R6aazv8FB-6BeanY3ag8kw 2009-09-26 gP17ykqduf3AlewSaRb61w 5 This place is super cute lunch joint. I had t... review mtoKqaQjGPWEc5YZbrYV9w 0 0 0 0
9971 JOZqBKIOB8WEBAWm7v1JFA 2008-07-22 QI9rfeWrZnvK5ojz8cEoRg 5 The staff is great, the food is great, even th... review uBAMd01ZtGXaHrRD6THNzg 1 2 1 4
9972 OllL0G9Kh_k1lx-2vrFDXQ 2012-10-23 U23UfuxN9DpAU0Dslc5KjQ 4 Yay, even though I miss living in Coronado I a... review Gh1EXuS42DY3rV_MzFpJpg 0 0 0 0
9973 XHr5mXFgobOHoxbPJxmYdg 2009-09-28 udMiWjeG0OGcb4nNddDkBg 5 Wow! Went on a Sunday around 11am - busy but ... review yRYNx24kUDRRBfJu1Rcojg 0 0 0 0
9974 cdacUBBL2tDbDnB1EfhpQw 2009-12-16 bVU-_x9ijxjEImNluy84OA 2 If Cowboy Ciao is the best restaurant in Scott... review V9Uqt00HXwXT6mzsVCjMAw 0 0 0 0
9975 EWMwV5V9BxNs_U6nNVMeqw 2007-10-20 g4LsVAoafmUDHiS-_yN4tA 5 When I lived in Phoenix, I was a regular at Fe... review TLj3XaclA7V4ldJ5yNP-9Q 1 1 0 2
9976 iDYzGVIF1TDWdjHNgNjCVw 2009-09-11 bKjMcpNj0xSu2UI2EFQn1g 3 I was looking for chile rellenos and this plac... review 2tUCLMHQKz4kA1VlRB_w0Q 0 0 0 0
9977 iDYzGVIF1TDWdjHNgNjCVw 2012-10-30 qaNZyCUJA6Yp0mvPBCknPQ 5 Why did I wait so long to try this neighborhoo... review Id-8-NMEKxeXBR44eUdDeA 3 6 3 12
9978 9Y3aQAVITkEJYe5vLZr13w 2010-04-01 ZoTUU6EJ1OBNr7mhqxHBLw 5 This is the place for a fabulos breakfast!! I ... review vasHsAZEgLZGJDTlIweUYQ 0 1 0 1
9979 GV1P1x9eRb4iZHCxj5_IjA 2012-12-07 eVUs1C4yaVJNrc7SGTAheg 5 Highly recommend. This is my second time here ... review bJFdmJJxfXgCYA5DMmyeqQ 2 2 1 5
9980 GHYOl_cnERMOhkCK_mGAlA 2011-07-03 Q-y3jSqccdytKxAyo1J0Xg 5 5 stars for the great $5 happy hour specials. ... review xZvRLPJ1ixhFVomkXSfXAw 6 6 4 16
9981 AX8lx9wHNYT45lyd7pxaYw 2008-11-27 IyunTh7jnG7v3EYwfF3hPw 5 We brought the entire family to Giuseppe's las... review fczQCSmaWF78toLEmb0Zsw 10 9 5 24
9982 KV-yJLmlODfUG1Mkds6kYw 2012-02-25 rIgZgxJPWTacq3mV6DfWfg 4 Best corned beef sandwich I've had anywhere at... review J-oVr0th2Y7ltPPOwy0Z8Q 0 0 0 0
9983 24V8QQWO6VaVggHdxjQQ_A 2010-06-06 PqiIeFOiVr-tj_FtHGAH2g 3 3.5 stars. \n\nWe decided to check this place ... review LaEj3VpQh7bgpAZLzSRRrw 1 4 1 6
9984 wepFVY82q_tuDzG6lQjHWw 2012-02-12 spusZYROtBKw_5tv3gYm4Q 1 Went last night to Whore Foods to get basics t... review W7zmm1uzlyUkEqpSG7PlBw 0 1 2 3
9985 EMGkbiCMfMTflQux-_JY7Q 2012-10-17 wB-f0xfx7WIyrOsRJMkDOg 4 Awesome food! Little pricey but delicious. Lov... review 9MJAacmjxtctbI3xncsK5Q 0 0 0 0
9986 oCA2OZcd_Jo_ggVmUx3WVw 2012-03-31 ijPZPKKWDqdWOIqYkUsJJw 4 I came here in December and look forward to my... review yzwPJdn6yd2ccZqfy4LhUA 0 0 0 0
9987 r-a-Cn9hxdEnYTtVTB5bMQ 2012-04-07 j9HwZZoBBmJgOlqDSuJcxg 1 The food is delicious. The service: discrimi... review toPtsUtYoRB-5-ThrOy2Fg 0 0 0 0
9988 xY1sPHTA2RGVFlh5tZhs9g 2012-06-02 TM8hdYqs5Zi1jO5Yrq6E0g 4 For our first time we had a great time! Our se... review GvaNZY4poCcd3H4WxHjrLQ 0 2 0 2
9989 mQUC-ATrFuMQSaDQb93Pug 2011-10-01 ta2P9joJqeFB8BzFp-AzjA 5 Great food and service! Country food at its best! review fKaO8fR1IAcfvZb6cBrs2w 0 1 0 1
9990 R8VwdLyvsp9iybNqRvm94g 2011-10-03 pcEeHdAJPoFNF23es0kKWg 5 Yes I do rock the hipster joints. I dig this ... review b92Y3tyWTQQZ5FLifex62Q 1 1 1 3
9991 WJ5mq4EiWYAA4Vif0xDfdg 2011-12-05 EuHX-39FR7tyyG1ElvN1Jw 5 Only 4 stars? \n\n(A few notes: The folks that... review hTau-iNZFwoNsPCaiIUTEA 1 1 0 2
9992 f96lWMIAUhYIYy9gOktivQ 2009-03-10 YF17z7HWlMj6aezZc-pVEw 5 I'm not normally one to jump at reviewing a ch... review W_QXYA7A0IhMrvbckz7eVg 2 3 2 7
9993 maB4VHseFUY2TmPtAQnB9Q 2011-06-27 SNnyYHI9rw9TTltVX3TF-A 4 Judging by some of the reviews, maybe I went o... review T46gxPbJMWmlLyr7GxQLyQ 1 1 0 2
9994 L3BSpFvxcNf3T_teitgt6A 2012-03-19 0nxb1gIGFgk3WbC5zwhKZg 5 Let's see...what is there NOT to like about Su... review OzOZv-Knlw3oz9K5Kh5S6A 1 2 1 4
9995 VY_tvNUCCXGXQeSvJl757Q 2012-07-28 Ubyfp2RSDYW0g7Mbr8N3iA 3 First visit...Had lunch here today - used my G... review _eqQoPtQ3e3UxLE4faT6ow 1 2 0 3
9996 EKzMHI1tip8rC1-ZAy64yg 2012-01-18 2XyIOQKbVFb6uXQdJ0RzlQ 4 Should be called house of deliciousness!\n\nI ... review ROru4uk5SaYc3rg8IU7SQw 0 0 0 0
9997 53YGfwmbW73JhFiemNeyzQ 2010-11-16 jyznYkIbpqVmlsZxSDSypA 4 I recently visited Olive and Ivy for business ... review gGbN1aKQHMgfQZkqlsuwzg 0 0 0 0
9998 9SKdOoDHcFoxK5ZtsgHJoA 2012-12-02 5UKq9WQE1qQbJ0DJbc-B6Q 2 My nephew just moved to Scottsdale recently so... review 0lyVoNazXa20WzUyZPLaQQ 0 0 0 0
9999 pF7uRzygyZsltbmVpjIyvw 2010-10-16 vWSmOhg2ID1MNZHaWapGbA 5 4-5 locations.. all 4.5 star average.. I think... review KSBFytcdjPKZgXKQnYQdkA 0 0 0 0

10000 rows × 11 columns


In [25]:
# new features: 
feature_cols = ['cool', 'useful', 'funny', 'sum_votes']
X = df[feature_cols]

In [26]:
# add new features to the model and calculate RMSE
print("RMSE for total votes:")
print(rmse_train_test(['sum_votes']))


RMSE for total votes:
1.20959949464

Task 8 (Bonus)

Compare your best RMSE on the testing set with the RMSE for the "null model", which is the model that ignores all features and simply predicts the mean response value in the testing set.


In [61]:
#test train split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=777)


# NUMPY array with the same shape as y_test
y_null = np.zeros_like(y_test, dtype=float)


# fill the array with the mean value of y_test
y_null = pd.DataFrame(y_null)
#print(y_test.mean())
y_null.fillna(3.7732)

# compute null RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_null))


Out[61]:
3.9621711219986451

In [ ]: